Social Media from Conflict Zones

Text analysis of social media posts from Syria

Michael L. Davies
mld9s

Overview

What we can learn about conflict environments from text analysis of social media posts? I assume that social media post from within a conflict zone reflects the environment. Of course, sentiments will likely ebb and flow, but are likely more negative in nature. Additionally, I wonder if text can predict (or are associated with) particular events on the ground.

For this project, I leveraged data maintained by The Armed Conflict Location & Event Data Project (ACLED). I narrowed the scope to the Syrian conflict, which persisted for more than a decade. Since 2017, ACLED has collected more than 80,000 social media and open source posts. The posts are then tagged by a curator with the date, province, city, and various associated events such as which actor (the Syria Army or non-state armed group) gained territory.

Research questions and approach

  1. I expect that the nature of conflict is different in different regions of Syria. So, when conducting sentiment analysis, I control for sentiments by province.
  2. I also plot sentiments over time.
  3. Last, I expect that the text from postings reflect events "on the ground." So, I implemement a number of classification models to test whether the text can predict changes in territorial control. As mentioned, the posts tagged by a curator with data on which actor (the Syria Army or non-state armed group) gained territory.

Note, because I'm interest in data science (rather than Anthropolical) conventions, I chose to diverge from the project description in a few ways.

Interestingly, and possibly because we didn't leverage standard Python libraries for this class, I found R much cleaner in it's approach and pipelines. And I found much more interesting results with R.

Paper outline

Section One: Pre-processing and building dataframes

Section Two: Sentiment Analysis

Section Three: Topic Modelling

Section Four: Cluster Analysis

Section Five: Classification - conventional pipelines

Implementation in R included in separate files

Findings

Beyond the analysis of parts of speech and word frequencies…

Sentiments/polarity:
Polarity counts to vary across provinces. This is to be expected. The conflict has taken on different characteristics in different provinces. In addition, polarity has varied widely over time, and differentially by province. The dominant emotions have been “fear”, “anger”, “sadness”. These emotions ebbed and flowed over time, but always outpaced “trust” and “joy” – regardless of which province we look at. This is later supported by the Vader approach. An important caveat with respect to sentiments – the word “opposition” appears to be coded in the lexicon as a negative emotion. However, it’s not always clear that the sentence itself is expressing negativity.

Topic Modeling
The text did not warrant a large number of topics, so I set the number of topics in Python to 10. (With R, I lowered the number of potential topics to 6) From a heat map of the topic scores, we once again see that topics vary by province. Here we see variation in the topics by province.

Classification
My primary interest was – can we use text from social media to predict the outcome of a battle. As such, my response variable is binary where the social media text is associated with "Syrian regime regains territory" or "Non-state actor gains territory" - recoded as 1 and 0 respectively I first clean the data for only those cases code according to the response variable. (Many posts are only associated with various types of clashes and battles with no turn over in terrain.) Fortunately, the data is balanced across labels, so no up/down sampling required and “accuracy” is sufficient for evaluating the models.

I then compete multiple classification algorithms.

I begin with the simplest model. sklearn's feature extraction package handles the text preprocessing before sending it to the logistic regression. Additionally, I import the module for the train/test split (set at 75/25)

In Python, I compete three classification approaches.

First, I leverage spaCy for preprocessing and added the TF-IDF module with the sklearn for classification. Second, I use a base line model, which is the simplest, form sklearn without TF-IDF. Last, I implemented a Keras neural network, which is a high-level language that sits on top of TensorFlow.

In the end, prediction was extremely robust, with all models achieve accuracy between 92 and 94 percent. Interestingly, Keras (TensorFlow) was not the most accurate.

Findings in R

Given that this is not required for this project, I won't elaborate on findings here. However, I was able to implement a much more sophisticated treatment of uni-grams and bigrams. As well, I conducted a network analysis of words associated with each label, (a cleaner) topic analysis, frequencies of co-occurances with words of interest. Last, I conducted a bootstrapped logistic regression to predict the labels. For this, I plotted a variables importance plot to show which words (unigrams or bigrams) were the most important in predicting the response.

Data

Configure the OHCO

Initial Exploratory Analysis of tokens

With unprocessed data

Pre-process data

POS Max

Bag o' Words

Document-Term Matrix

TF-IDF

Sentiment Analysis

Explore Sentiment in at Sentence Level

VADER

Positive and Negative

Neutral

Compound - combination of pos and neg

Topic Modeling

Configs

We use Scikit Learn's CountVectorizer to convert our F1 corpus of paragraphs into a document-term vector space of word counts.

THETA

PHI

Inspect Results

Get Top Terms per Topic

Sort Topics by Doc Weight

Explore Topics by Province

Sorted descending in terms of Idleb province, which seem to be the most volitile province.

Comparing Idleb, which is dominated by hardline Arab opposition factions (largely backed by Turkey) with Deir ez Zor province, which is Kurd dominated opposition (largely backed by the US).

Clutser Topics

Build VOCAB and LIB

Classification

Logistic Regression with train/test split

In Python, I compete three classification approaches. First, I use a base line model, which is the simplest, form sklearn. Second, I leverage spaCy for preprocessing and added the TF-IDF module with the sklearn for classification. Last, I implemented a Keras neural network, which is a high-level language that sits on top of TensorFlow.

spaCy

Defining a Custom Transformer

Vectorization Feature Engineering (TF-IDF)

Creating a Pipeline and Generating the Model

Without TF-IDF

Neural Networks with Keras

Evaluate the accuracy of the model: